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Abstract — We present a 0.5 Petaflop/s calculation of homoge- 
neous isotropic turbulence in a cube of 2048^ particles, using a 
highly parallel fast multipole method (FMM) using 2048 GPUs 
on the TSUBAME 2.0 system. We compare this particle-based 
code with a spectral DNS code under the same calculation 
condition and the same machine. The results of our particle- 
based turbulence simulation match quantitatively with that of 
the spectral method. The calculation time for one time step is 
approximately 30 seconds for both methods; this result shows 
that the scalability of the FMM starts to become an advantage 
over FFT-based methods beyond 2000 GPUs. 

I. Introduction 

For the past few decades, the preferred method for the 
simulation of homogeneous isotropic turbulence in a peri- 
odic cube has been the pseudo-spectral method. The largest 
direct numerical simulation of isotropic turbulence to date 
was performed by Ishihara et al. [JJ, [2] using 4096^ grid 
points at a maximum Taylor-microscale Reynolds number 
R\ w 1130. This record-breaking simulation was done on 
Earth Simulator — a large, shared-memory system that can 
efficiently perform large-scale FFT. The successive generations 
of supercomputers have not been so FFT-friendly, and this 
record has not been broken even though the peak performance 
of supercomputers has increased nearly 50-fold since then. 

Future high-performance computing systems will have ever 
more nodes, and ever more cores per node, but will probably 
not be equipped with the bandwidth required by many popular 
algorithms to transfer the necessary data at the optimal rate. 
This situation is detrimental to parallel scalability. Therefore, 
it is becoming increasingly important to look at alternative 
algorithms that may achieve better performance on these 
extremely parallel machines of the future. 

In most standard methods of incompressible CFD, the 
greatest fraction of the calculation runtime is spent solving 
a Poisson equation. Equations of this type can be efficiently 
solved by means of an FFT-based algorithm, a sparse linear 
solver, or a fast multipole method (FMM). For the sake of our 



argument, we will not differentiate between FFT-based Poisson 
solvers and pseudo-spectral methods because they both rely 
on FFT. The fast multipole method has been less popular 
due to the fact that it is substantially slower — depending on 
implementations, around an order of magnitude slower — than 
FFT and multigrid solvers when compared using a small CPU 
cluster. We aim to show with this ongoing study that the 
relative performance of FMM improves as one scales to large 
numbers of GPUs. 

The highly scalable nature of the FMM algorithm, among 
other features, makes it a top contender in the algorithmic 
toolbox for Exascale systems. One point of evidence of this 
argument is given by the winning Gordon Bell entry of 
last year IS], which achieved 0.7 Petaflop/s with an FMM 
algorithm on 200k cores of the Jaguar system at Oak Ridge 
National Laboratory. In the previous year, FMM also figured 
prominently at the Supercomputing Conference, with a paper 
among the finalists for the Best Paper award (4) and the 
Gordon Bell prize in the price/performance category going to 
work with hierarchical A^-body methods on GPU architecture 

m. 

In fact, the FMM algorithm is well adapted to the architec- 
tural features of GPUs, which is an important consideration 
given that it is considered an architecture of likely dominance 
as we move towards Exascale. The work presented in 2009 — 
winner in the price/performance category in great measure 
thanks to the ingenious and resourceful system design using 
gaming hardware — ^reported (at the time of the conference) 80 
Teraflop/s.|5J This work progressed to an Honorable Mention 
in the 2010 list of awardees, with 104 Teraflop/s. |6| 

At the level of the present work, where we present a 0.5 
Petaflop/s calculation of homogeneous isotropic turbulence, 
the FMM moves firmly into the arena of Petascale GPU 
computing. The significance of this advance is that we are 
now in a range where the FMM algorithm shows its true 
capability. The scalability of FMM using in the order of 



1000 GPUs shows advantage over the dominance of FFT- 
based algorithms. Showcasing the FMM in a simulation of 
homogeneous isotropic turbulence is then especially fitting, 
given that a years-old record there remains unchallenged. 
Based on our results, we are confident that a turbulence 
simulation repeating the 4096"^ record, but with an FMM-based 
algorithm on GPUs, will be attained soon. Conceivably, we 
will see this nine-year old record finally broken using GPU 
hardware. 

II. Parallel Fast Multipole Method on GPUs 

A. Tree Partitioning 

When parallelizing hierarchical N-body algorithms, the fact 
that the particle distribution is dynamic makes it impossible to 
precompute the decomposition and to balance the work-load 
or communication a priori. Warren and Salmon Q developed 
a parallel algorithm for decomposing the domain into recursive 
subdomains using the method of orthogonal recursive bisection 
(ORB). The resulting process can be thought of as a binary 
tree, which splits the domain into subdomains with equal 
number of particles at every bisection. 

Another popular technique for partitioning tree structures 
is to use Morton ordering fW\, where bits representing the 
particle coordinates are interleaved to form a unique key that 
maps to each cell in the tree. Following the Morton index 
monotonically will take the form of a space filling curve in the 
shape of a "Z". Equally partitioning the list of Morton indices 
assures that each partition will contain an equal number of 
cells, regardless of the particle distribution. 

For an adaptive tree, a naive implementation of the Morton 
ordering could result in large communication if the "Z" is split 
in the wrong place. Sundar et al. \9\ proposed a bottom-up 
coarsening strategy that ensures a clean partition at the coarse 
level while using Morton ordering. On the other hand, ORB 
always partitions the domain into rectangular subdomains, 
and works fine with treecodes. The problem with ORB for 
FMM usually occurs when one attempts to use square oct-trees 
with ORB instead of using the rectangular recursive bisection 
throughout the whole tree. The stack based treewalk described 
in the following subsection permits the use of rectangular 
bisections throughout the whole tree, while enabling cell-cell 
interactions in the FMM with minimum complications. 

Our current partitioning scheme is an extension of the 
ORB, which allows multisections instead of bisections. We 
developed a parallel version of the "nth-element" algorithm to 
find the global median (N/2-th element). The same algorithm 
can be used to find the N/3-th element as well, and there- 
fore the extension to multisections was straightforward. This 
recursive multisection allows efficient partitioning for number 
of processes that are not a power of two. 

B. Stack Based Treewalk 

The 0{N) treewalk fTOl enables cell-cell interactions in the 
traditional treecode framework. Unlike conventional treecodes, 
the 0{N) treewalk is not performed per target cell. We 
describe the 0{N) treewalk in Algorithm [T] which calls an 



internal routine for the interaction of a pair of cells, given in 
Algorithm |2] First, a pair of cells is pushed into the stack. 
It is most convenient to start from the root cell although 
there is no chance of two root cells interacting. For every 
step in the while-loop, a pair of cells is popped from the 
stack and the larger cell is subdivided. Then, Algorithm [2] 
is called to perform either particle-particle (P2P), multipole- 
particle (M2P), or multipole-local (M2L) interactions. Unlike 
a pure treecode which calculates only P2P and M2P, or a 
pure FMM which calculates only P2P and M2L the current 
algorithm chooses whichever operation is faster If none of 
the above operations are worth doing at this stage of the 
subdivision, the pair is pushed to the stack and will be handled 
later. It is a trivial matter to run each of these kernels at the 
beginning of the calculation and to auto-tune the criteria for 
choosing which kernel is faster for a given pair of cells. This 
turned out to be extremely useful for auto-tuning the kernel 
selection on CPUs and GPUs simultaneously. 



Algorithm 1 Evaluate() 
A=B=rootcell 

push pair(A,B) into a stack 
while the stack is not empty do 
Pop stack to get a pair(A,B) 
if radius of A > radius of B then 
for all child cells "a" of cell A do 

Interact(a,B) 
end for 
else 

for all child cells "b" of cell B do 

Interact(A,b) 
end for 
end if 
end while 



Algorithm 2 Interact(cell A, cell B) 

if there are very few particles in both cells, or they don't 
have child cells then 

Evaluate particle-particle (P2P) 
else if cells A and B are well separated then 

if there are very few particles in cell A, or there are no 

child cells in cell A then 

Evaluate multipole-particle (M2P) 

else 

Evaluate multipole-local (M2L) 
end if 
else 

Push pair(A,B) into a stack 
end if 



We will give a simple example of what will typically happen 
when this algorithm is executed. During the initial stages of 
the while loop in Algorithm [T] the cells will have too many 
particles to perform P2P, and will be too close to each other 



to perform M2P/M2L, so the pairs will be pushed to the 
stack. After a few steps, the pairs will become well separated 
and M2L interactions will be performed. If the distribution 
of particles is highly non-uniform and the tree has a very 
uneven structure, some cells will contain very few particles 
and M2P interactions will take place. At the final stage of the 
treewalk, the cells will reach the terminal level where there 
are no more subcells to divide into, and P2P interactions will 
be performed. Any cell that has not been covered by the M2L 
and M2P interactions at this point will be handled by P2P 
interactions. When all the P2P interactions are finished the 
stack will be empty and Algorithm [T] will terminate. In the end 
this will result in a similar interaction list as the one described 
by Cheng et al. fTV\, but the implementation is much simpler 
and the algorithm itself is much more general and flexible. 

This framework allows 0{N) treewalks on not only Morton 
ordered square oct-trees, but also binary trees with rectangular 
ORB or any other hierarchical decomposition of the domain. 
It also enables the hybridization of treecodes and FMMs in the 
most natural way, and at the same time, adds the capability to 
auto-tune the kernel selection to optimize the performance on 
heterogenous architectures. 

C. Local Essential Tree 

Once the domain is partitioned, each process will own a 
local portion of the global tree structure. It is then necessary 
to communicate the local essential tree (LET), which is a sig- 
nificantly smaller subset of the global tree that contains exactly 
all the information that is necessary to perform the evaluation. 
Salmon and Warren [7J introduced two key techniques; one for 
finding which cells to send, and the other for communicating 
the data efficiently. 

The determination of which data to send is related to how 
the data will be used in the evaluation stage. How the data 
will be used can be predicted by redefining the multipole 
acceptance criteria (MAC) as the distance from the cell in 
the other process to the edge of the domain of the current 
process. This way the tree is traversed only once per process, 
which is a negligible cost compared to the actual tree traversal 
that is done per leaf cell. This method can be applied directly 
to our stack-based treewalk approach, since the LET itself is 
identical to a standard treecode. 

The communication of the LET requires global commu- 
nication and is a potential bottleneck for the scalability of 
hierarchical N-body algorithms. In order to communicate LET 
efficiently, the bisectors in the ORB can be used to repeatedly 
exchange the data required on the other side of the bisector 
This N-D hypercube or All-reduce-like communication is used 
in state-of-the-art treecodes |5| and FMMs |4| to this day. 
The present implementation combines this technique with the 
recursive multisection described in section III-AI to allow the 
same efficient communication when the number of processes 
is not a power of two. We also extend this entire framework 
to the case of periodic boundary conditions. 
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Fig. 1: Interaction list for a periodic EMM 



D. Periodic FMM 

The FMM originally solves potential problems with free- 
field boundary conditions. However, the method can be ex- 
tended to handle periodic boundary conditions by placing 
periodic images around the original domain and clustering 
them into large cells, fV2\ as shown in Fig. [T] There are 
two reasons why periodic FMMs add almost no computational 
overhead to the original FMM. The first is the fact that further 
domains are clustered into larger and larger cells, so the 
extra cost of considering another layer of periodic images is 
constant. The second reason is that only the sources need to be 
duplicated and the evaluation points exist only in the original 
domain. Since the work load for considering the periodicity is 
independent of the number of particles, it becomes negligible 
as the problem size increases. We have confirmed in an 
earlier study that periodic boundary conditions adds 3 % more 
calculation time for a million particles. Ill3l 

The current implementation of the periodic FMM uses each 
bit of an integer to store the information of the periodic 
images. For convenience, let us refer to this integer as the 
"periodic flag". When considering the 26 periodic images 
surrounding the original domain, one traverses the tree 27 
times (including the original domain) and flips the bit of the 
periodic flag if it satisfies the MAC. In the current FMM, the 
tree traversal is separated from the evaluation. Interactions are 
queued and evaluated in one batch after the tree traversal is 
finished because this is an efficient model for evaluating the 
kernels on the GPU as we will describe in section ITLEI Once 
all 27 tree traversals are finished, the kernels are evaluated in 
the same manner as a non-periodic FMM, but the loop over the 
source cells is performed 27 times with a conditional statement 
for the periodic flag inside. The use of a periodic flag results in 
minimum modification to the tree construction, traversal, LET 
communication, and evaluation for parallel periodic FMM on 
GPUs. We also make use of the C++ STL map container to 
store the periodic flags with the cell index as the key so that 
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Fig. 2: Flow of treecode/FMM and definition of their kernels. 



collisions of periodic flags for the same cell are automatically 
resolved. 

E. GPU kernels 

Our hybrid hierarchical N-body algorithm has 7 different 
kernels, as shown in Fig. |2] All of these kernels are evaluated 
on the GPU. Out of these 7 kernels, a majority of the time is 
spent on the P2P, M2L, and M2P kernels. These three kernels 
are evaluated in one batch after the tree traversal is completed, 
since there is no data dependency between these three kernels. 
This batch evaluation can be broken down into multiple calls 
to the GPU depending on the data size. Therefore we were 
able handle problem sizes of up to 100 million on a single 
process if the memory on the host machine was large enough. 

m 

As an example of the usage of thread blocks in the GPU 
execution model, we show an illustration of the M2L kernel 
on GPUs in Fig. [3] Each coefficient of the multipole/local 
expansion is mapped to a thread on the GPU, and each target 
cell is mapped to a thread block while each source cell is 
loaded to shared memory and evaluated sequentially. All other 
kernels are mapped to the threads and thread blocks in a 
similar manner. 

III. Integration of FMM into Turbulence 
Simulations 

A. Vortex Method 

The vortex method |15 | is a particle based method for fluid 
dynamics simulations. The particle based discretization allows 
the continuum physics to be solved as a A^-body problem. 
Therefore, the hierarchical A^-body methods that extract the 
full potential of GPUs can be used for the simulation of 
turbulence. Unlike other particle based fluid dynamics solvers 
e.g. SPH|16|, the vortex method is especially well suitable 
for solving turbulence, because the vortex interactions seen 
in turbulent flows are exactly what it calculates using the 
interacting vortex particles. 

Since the vortex method is neither a standard method for 
simulating turbulence or a standard application for FMMs, 
we will give a brief explanation of the method itself. In the 
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Fig. 3: Execution of M2L kernel on GPUs. 



vortex method, the Navier-Stokes equation is solved in the 
velocity-vorticity form, and discretized with vortex particles. 
The velocity is calculated by 



N 



X VG 



(1) 



where a is the strength of vortex particles. G = 1/Anrij 
the Green's function for the Laplace equation and 
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(2) 



is the cutoff function, where r is the distance between the 
interacting particles, and a is the standard deviation of the 
Gaussian function. The vorticity equation is solved in a 
fractional step manner by calculating the stretching 
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and the diffusion 
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separately. We perform a radial basis function interpolation for 
reinitialized Gaussian distributions to ensure the convergence 
of the diffusion calculation. 

Equations ([T]) and Q are N-body interactions, and are eval- 
uated using the FMM. The flow of the entire vortex method 
is shown in Fig. |4] First, the domain is partitioned using 
the recursive multisection described in section III-AI Then the 
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Fig. 4: Flow of vortex method calculation with FMM 



FMM kernels for the Local Tree are evaluated while the LET 
is being communicated in a separate openmp section. After the 
LET is communicated, the FMM kernels are evaluated again 
for the remaining parts of the LET. Subsequently, the position, 
vortex strength a, and core radius a are updated locally. This 
information is communicated in the next round when the LET 
is exchanged. Furthermore, in Lagrangian CFD simulations 
it is necessary to reinitialize the particle positions so the 
distribution of particles remains somewhat even. This actually 
allows us to reuse the same tree structure since the particles 
are reinitialized to the same position every time. Therefore, 
the partitioning is performed only once in this simulation. 

B. Spectral Method 

For the spectral method calculation, we used a standard 
code for homogeneous isotropic turbulence called HIT3D. 
The code is available from google code and uses a spectral 
Galerkin method with primitive variable formulation with 
pseudo-spectral methods to compute the convolution sums. A 
schematic of the parallelization of the EFT is shown in Fig. 
|5] The aliasing was removed by the 2/3-rule and the time 
integration was performed by a 2nd order Adams-Bashforth 
method. No forcing was applied since it would be difficult to 
do so with vortex methods. 

The initial condition was generated in Fourier space as a 
solenoidal isotropic velocity field with random phases and a 
prescribed energy spectrum. This initial velocity field has a 
Gaussian PDF and satisfies the incompressibility condition. 
The vortex method uses the same initial condition by first 
calculating the vorticity field in physical space, and then 
using radial basis function interpolation to obtain the vortex 
strengths. 
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C. Calculation of Isotropic Turbulence 

The calculation of homogeneous isotropic turbulence was 
performed for an initial Reynolds number of Re\ w 500. 
The calculation domain is [— 7r,7r] and has periodic boundary 
conditions in all directions, and the grid size was 2048'^. 
This results in an N-body simulation of a total of 8.6 billion 
particles. The total number of GPUs used was 2048. For the 
periodic FMM the number periodic images was S'^ in each 
dimension. The order of multipole expansion was set iop = 14 
to achieve the required accuracy for the present application. 

The isosurface of the second invariant of the velocity 
gradient tensor is shown in Fig. |6] This is a snapshot at the 
early stages of the simulation and we do not observe any 
large coherent structures. In order to take a closer look at 
the quantitative aspects of the vortex simulation, we compare 
the kinetic energy spectrum with that of the spectral method 
in Fig. [7] where T is the eddy turnover time. It is quite clear 
that we have an excellent quantitative agreement between the 
vortex method and spectral method. Therefore, we conclude 
that our FMM based particle method is capable of simulating 
the same scale of turbulence as the spectral method with the 




Fig. 6: Isosurface of the second invariant of the velocity 
gradient tensor 
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Fig. 8: Weak Scaling for 4 million particles per process. 



Voltaire Inc., with non-blocking and full bisectional band- 
width. Each node has 2 x 40 Gbps bandwidth, and the bisection 
bandwidth of the system is over 200 Tbps. The total number of 
M2050 GPUs in the system is 4224, and the peak performance 
of the entire system is 2.4 Petaflop/s. 



B. Weak Scaling 
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Fig. 7: Kinetic Energy Spectrum at t/T=2 



same spatial resolution. We will turn our attention to the speed 
of the calculation in the next section. 

IV. Scalability Results 

A. Hardware 

The present calculations are run on the TSUBAME 2.0 sys- 
tem, which has 1408 nodes equipped with 12-core Westmere- 
EP 2.93GHz CPUs, 3 NVIDIA M2050 GPUs, 54 GB (96 GB 
on 41 nodes) of RAM, and 120 GB (240 GB on 41 nodes) 
of local SSD storage. Each computing node is interconnected 
with the InfiniBand device Grid Director 4700 developed by 



The results of the weak scaling test with 4 million particles 
per process is shown in Fig. [8] The local evaluation is the P2P 
kernel evaluation in Fig.|2] and the FMM evaluation is the sum 
of all the other kernel evaluations. The MPI communication is 
overlapped with the kernel evaluations but we decided to show 
the time for the communication by subtracting an equivalent 
area from the FMM evaluation time. Therefore, the total height 
of the bar correctly represents the total wall clock time of the 
calculation. The GPU communication time is the time spent on 
the cudaMemcpy. We plan to reduce this portion in the future 
by performing asynchronous memory transfers with double 
buffering. Finally, the tree construction represents the time 
required to construct the tree and sort the particles. From Fig. 
|8] we see that the current FMM is able to completely hide the 
communication up to 2048 GPUs. 

The parallel efficiency of the weak scaling test is shown 
in Fig. |9] We also performed a similar weak scaling test for 
the spectral method. The parallel efficiency of the FMM is 72 
% at 2048 GPUs, while the parallel efficiency of the spectral 
method is 27 % on 2048 CPUs. The botdeneck of the spectral 
method is the aU-to-all communication for transposing the 
slabs into pencils as shown in Fig. |5] Even though this is not 
the best implementation of a parallel EFT, the difference in 
the scalability between the spectral method and FMM is quite 
obvious. We would also like to note that the actual calculation 
time for one time step was approximately 30s for both the 
FMM and spectral method for 2048 processes. Therefore, the 
superior scalability of the FMM has merely closed the gap. 
However, we anticipate that this trend will affect the algorithm 
of choice if we were to move to architectures with higher 
degree of parallelism. 
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Fig. 9: Comparison of Weak Scaling between FMM and" 
spectral method 
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TABLE I: Floating point operations per interaction 

26 



Operation 


Biot-Savart 


Stretching 


+ 


19 


25 




14 


18 




32 


56 


/ 


1 


1 


sqrtf 


1 


1 


rsqrtf 


1 


1 


expf 


2 


2 


Total 


70 


104 



C. Performance Highlights 

The present calculation is dominated by the floating point 
operations in the particle-particle interaction and all other parts 
are negligible in terms of Flops. Two separate equations are 
calculated for the particle-particle interaction; the Biot-Savart 
equation ([T]l and the Stretching equation Q. The CUD A 
source code for the Biot-Savart and Stretching kernel are 
shown in Listings [T] and [2] respectively. 

The equation to calculate the Flops for the P2P kernel is 





(Processes) 


X 


{Target particles per process) 


X 


{Source cells per target) 


X 


{Source particle per cell) 


X 


{Flops per interaction) 


/ 


{Wall clock time) 




2048 X (4 X 10^) X 19 X 488 x 174/26 




4.93 X 10^-* = 493 Teraflop/s 



Compared to last years peak performance winner (0.7 
Petaflop/s), we are able to achieve similar performance (0.5 
Petaflop/s) on a problem size that is an order of magnitude 
smaller (90 billion vs. 8.6 biilion), on a GPU cluster using 
only 1/100 the number of MPI processes (200,000 AMD cores 



Listing 1: P2P kernel for a single Biot-Savart interaction 

device inline void BiotSavartP2P_core {float *target, 

float *targetX, float *sourceShrd, floats d, int i) { 

d.x += targetX [0] ; 

d.x -= sourceShrd [ 7 *i+0 ] ; 

d.y += targetX [1] ; 

d.y -= sourceShrd [ 7 *i+l ] ; 

d.z += targetX [2] ; 

d.z -= sourceShrd [ 7 *i+2 ] ; 

const float SQRT4PI = M_2_SQRTPI; 

const float FOURPI = 0.25 * M_1_PI; 

float R2 = d.x * d.x + d.y * d.y + d.z * d.z + EPS2; 

float SQRT_R2_1 = rsqrtf (R2); 

float RS = R2 * sourceShrd [7*1+6] ; 

float SQRT_RS = sqrtf (RS); 

float z = SQRT_RS,t,r; 

(t) =1 . Of / (1 . Of + . 5f * (z) ) ; 

(r) = (t) *expf (-(z)*(z)-1.26551223f 

+ (t)*(l. 00002368 f+(t)*(0.37409196f 

+ (t)*(0. 09678418 f+(t)*(-0.18628806f 

+ (t)*(0. 27886807 f+(t)*(-1.1352G398f 

+ (t)*(l. 48851587 f+(t)*(-0.82215223f 

+ (t) *0 . 17087277f ) )))))))); 
float cutoff = FOURPI * SQRT_R2_1 

* SQRT_R2_1 * SQRT_R2_1 * ( l.Of - r 
- SQRT4PI * SQRT_RS * expf(-RS)); 
target[0] += (d.y * sourceShrd [ 7 *i+5 ] 

- d.z * sourceShrd [ 7 *i+4 ] ) * cutoff; 
target[l] += (d.z * sourceShrd [ 7 *i+3 ] 

- d.x * sourceShrd [ 7 *i+5 ] ) * cutoff; 
target[2] += (d.x * sourceShrd [ 7 *i+4 ] 

- d.y * sourceShrd [ 7 *i+3 ] ) * cutoff; 

} 



VS. 2,000 GPUs). 

Compared to the price/performence winner of 2009 and 
Honorable mention for price/performance of 2010, we are able 
to achieve 5 times more Flop/s on 3.5 times more GPUs. 

The present vortex method simulation of 8.6 billion particles 
is also the largest particle-based CFD simulation to date as far 
as the authors are aware. 

The full system of TSUBAME 2.0 offers 4224 GPUs, but 
we were only able to use about half of the nodes at this 
time. We hope to perform a full node run by the time of the 
conference and we expect to exceed a PetaFlop with further 
tuning. The tuning will include overlap of GPU buffering/com- 
munication with GPU computations by using asynchronous 
data transfers and double buffering. Also, the full node run 
will allow us to perform a 4096"^ particle run instead of the 
current 2048'^, which will most likely lead to better scaling. 

V. Conclusions 

This work represents several milestones. Although the FMM 
algorithm has been taken to Petascale before (notably, with 
last year's winner of the Gordon Bell prize), the present work 
represents the first time that this is done on GPU architecture. 
Also, to our knowledge, the present work is the largest 
direct numerical simulation with vortex methods to date, with 
8.6 billion particles used in the cubic volume. Yet another 
significant event is reaching a range where the highly scalable 
FMM starts showing advantage over FFT-based algorithms. 



Listing 2: P2P kernel for a single Stretching interaction 

1 device inline void StretchingP2P_core ( 

2 float ^target, float *targetX, float *targetQ, 

3 float *sourceShrd, floats d, int i) { 

4 d.x += targetX [0] ; 

5 d.x -= sourceShrd [ 7 *i + ] ; 

6 d.y += targetX[l] ; 

7 d.y -= sourceShrd [ 7 *i+l ] ; 

8 d.z += targetX[2] ; 

9 d.z -= sourceShrd [ 7 *i+2 ] ; 

10 float R2 = d.x * d.x + d.y * d.y + d.z * d.z + EPS2; 
n const float SQRT4PI = M_2_SQRTPI; 

12 const float FOURPI = 0.25 * M_1_PI; 

13 float SQRT_R2_1 = rsqrtf(R2); 

14 float RS = R2 * sourceShrd[7*i + 6] ; 

15 float SQRT_RS = sqrtf(RS); 

16 float z = SQRT_RS, t, ERF_SQRT_RS; 

17 (t) =1 . Of / (1 . Of + . 5f * (z) ) ; 

IS ERF_SQRT_RS = 1 . Of - ( t ) *expf (-( z )*( z ) -1 . 2 655122 3f 

19 +(t)*(l. 00002368 f+(t)*(0.37409196f 

20 +(t)*(0. 09678418 f+(t)*(-0.18628806f 

21 +(t)*(0. 27886807 f+(t)*(-1.13520398f 

22 +(t)*(l. 48851587 f+(t)*(-0.82215223f 

23 + (t) *0 . 17087277f ) )))))))); 

24 float EXP_RS = expf(-RS); 

25 float cutoff = FOURPI * SQRT_R2_1 

26 * SQRT_R2_1 * SQRT_R2_1 * (ERF_SQRT_RS 

27 - SQRT4PI * SQRT_RS * EXP_RS) ; 

28 target[0] += (targetQ[l] * sourceShrd [ 7 *i + 5 ] 

29 - targetQ[2] * sourceShrd [7*1 + 4 ] ) 

30 * cutoff; 

31 target[l] += (targetQ[2] * sourceShrd [ 7 *i + 3 ] 

32 - targetQ[0] * sourceShrd [ 7 *i + 5 ] ) 

33 * cutoff; 

34 target[2] += (targetQ[0] * sourceShrd [7*1 + 4 ] 

35 - targetQ[l] * sourceShrd [ 7 *i + 3 ] ) 

36 * cutoff; 

37 float cutoff2 = FOURPI * SQRT_R2_1 

38 * SQRT_R2_1 * SQRT_R2_1 

39 * SQRT_R2_1 * SQRT_R2_1 

40 * (3. Of * ERF_SQRT_RS - (2. Of * RS + 3. Of) 

41 * SQRT4PI * SQRT_RS * EXP_RS) 

42 * (targetQ[0] * d.x + targetQ[l] * d.y 

43 + targetQ [2] * d.z); 

44 target[0] += (sourceShrd [7*1 + 4 ] * d.z 

45 - sourceShrd [ 7 *i + 5 ] * d.y) * cutoff2; 

46 target[l] += ( sourceShrd [ 7 *i+5 ] * d.x 

47 - sourceShrd [ 7 *i + 3 ] * d.z) * cutoff2; 

48 target[2] += ( sourceShrd [ 7 *i+3 ] * d.y 

49 - sourceShrd [7*1 + 4 ] * d.x) * cutoff2; 



50 } 



With a 0.5 Petaflop/s calculation of isotropic turbulence in 
a 2048"^ box, using 2000 GPUs, we are within reach of a 
turning point. The combination of application, algorithm, and 
hardware used are also notable. 
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